This file describes the preliminary analyses of three test-concepts in the QLVLnewscorpora: penis, inleiding & hart. The concepts were selected from the full list of concepts (N = 433) that I collected from WordNet, Van Dale and DLP2. Information about the full set of concepts is available here:
At this moment, parameters selection was based on observations in Mariana's analyses of nouns & verbs, as well as comments in the parameters google doc. At this moment, the following parameter settings were used to construct token models:
| parameter name | FOC | SOC |
|---|---|---|
| Definition target type | lemma/pos | lemma/pos |
| Window size | fixed: 10 | fixed: 4 |
| Boundaries | sentence/none | none |
| cw selection: strategy | local/global | global |
| cw selection: settings | local: * nav with freq > 200 * collfreq = 3 * ppmi > 1 * llr None or > 1 * global: nav top-5000 |
nav top-5000 |
| Weighting | ppmi | none |
Of these, I plan to vary the boundaries (default: sentence) and the context word selection settings for FOC's. Specifically, I will compare implementing an LLR-filter or not within the "local"1 strategy, as well as a local versus a global2 strategy. In the latter case, all top-5000 nav context words will be considered.
This concept was selected because it's a difficult one, with many variables (N = 17, excluding constructions) and varying frequencies per variable.
| variant | frequency |
|---|---|
| ding/noun | 80601 |
| fluit/noun | 1447 |
| jongeheer/noun | 105 |
| lid/noun | 107912 |
| lul/noun | 1155 |
| mannelijkheid/noun | 459 |
| penis/noun | 1252 |
| piemel/noun | 372 |
| pik/noun | 451 |
| pisser/noun | 4 |
| plasser/noun | 18 |
| potlood/noun | 1504 |
| sjarel/noun | 6 |
| snikkel/noun | 18 |
| speer/noun | 1217 |
| tampeloeres/noun | 1 |
| zwengel/noun | 42 |
This causes two problems for the models & analysis:
A possible solution for the latter problem is to only sample the relevant tokens for the highly frequent types. This can be done in two ways:
To find a way of extracting context words for the problematic variants, we need a tokenmodel for the non-problematic ones that performs well. The best model would be a model that (1) has a good fit to the data (to avoid artifical effects, e.g. regional differences) and (2) has a (relatively) clear semantic region (or branch) where most observations for the target concept are located (precision), while out-of-concept tokens are located somewhere else (recall). As in other studies in the NephoSem-project, determining what the best model is, is not straightforward. There are a number of procedures that can be considered:
So far, models with and without sentence bounds have been constructed according to the following parameters. All the tokenmodels have the following settings:
| parameter name | FOC | SOC |
|---|---|---|
| Definition target type | lemma/pos | lemma/pos |
| Window size | fixed: 10 | fixed: 4 |
| Boundaries | sentence/none | none |
| cw selection: strategy | local |
global |
| cw selection: settings | local: * nav with freq > 200 * collfreq = 3 * ppmi > 1 * llr None * |
nav top-5000 |
| Weighting | ppmi | none |
You can find a shiny-app to explore the models that have been analyzed so far here.
The following models only consider context words within the same sentence.
The t-SNE-solutions additionally vary according to two parameters:
Overall, it looks like the more stable models are the ones with more runs and perplexity 30. Models with very low perplexity (N = 10) look like they have too many small clusters. Choosing other settings than 'lemma' for the colors in the model plot shows that none of the lectal variables in the data seem to play a role.
I have tried four NMDS-solutions so far:
The first nmds-solutions is really bad, with a high stress value (> 0.28) and it did not converge. The second solution ran for over twelve hours and was only at trial 178 with stress-values comparable the first solution (at this point, I killed the process). The third and fourth solutions are the best ones so far, with in both cases a stress value of 0.1334 (for the same trial), but still no convergence. The second dimension may be the one were after but it's not the case that all variants with the target meaning are at the bottom of the plot, not that all out-of-concept variants are at the top.
Since we're running into problems with the NMDS-models, I analyzed where the problematic tokens are located. I used goodness() from library(vegan) to obtain a goodness-of-fit-value per token:
goodness() finds a goodness of fit statistic for observations (points). This is defined so that sum of squared values is equal to squared stress. Large values indicate poor fit.
This plot shows the results for the fourth NMDS-solution. It shows that the less problematic tokens (lighter colours) are located at the top left op the plot, where the observations for fluit and potlood are located - typically with their prototypical meaning, as well as with the tokens for penis in the middle. Perhaps the model has more trouble with variants that are more polysemous
Finally, I also used agglomerative hierarchical clustering (Ward's method) to analyze these data. Rather than choosing a number of clusters beforehand, I considered between 2 and 50 clusters, basing the optimal number of clusters on the silhouette width of the clusters. The optimal number of clusters is 45 in these data (sw = 0.358), with solutions that have 15 clusters or more reaching acceptable results (sw > 0.2). Obviously, solutions with 15 or more clusters are difficult to intepret, but for the purpose of illustration, this plot shows the solution with 15 clusters isolate one cluster by double-clicking on its symbol in the legend). The x- and y-axis show the results from the t-SNE-solution with perplexity = 30 and 5000 runs. Some of the clusters make a lot of sense, especially when they're also the ones that are seperated by t-SNE as well (e.g. the ouwe lul-cluster at the right of the plot in magenta). Other are more diverse (e.g. clusters 2 and 5).
With fewer clusters only some of the clearer division are (obviously) retained. Cluster 3 in the solution above, for instance, has body parts as context words. In the solution below, it is added to the more diverse cluster 2.
The following models also consider context words outside of the sentence.
The t-SNE-solutions again vary according to two parameters:
It looks like the models have more clear clusters from perplexity = 30 onwards (both for 1000 and 5000 runs). Models with very low perplexity (N = 10/20) again look like they have too many small clusters. Interestingly, with 1000 runs and perplexity 30, the 'oude lul'-cluster is moved far away (left of the plot) from the uses of 'lul' referring to the target concept (right side of the plot). All in all, this looks like a very good solution, with the target concept mostly in the bottom right quadrant of the plot (though it's not completely perfect).
Choosing other settings than 'lemma' for the colors in the model plot shows that none of the lectal variables in the data seem to play a role.
In contrast with the first model, I no longer constructed an NMDS-solution for k = 2 dimensions, as this caused problems. The NMDS-solutions again throw a convergence error after finishing. This indicates that you can't be sure that the solution it has is a local optimum rather than a global one. It may be necessary to change more parameter settings (specifically the convergence criteria), as the metaMDS-vignette states:
In addition to too slack convergence criteria and too low number of random starts, wrong number of dimensions (argument k) is the most common reason for not finding convergent solutions.
The solutions for nmds nruns = 100 and 250 look identical. This may have to do with the fact that they choose the same run as the best solution, but I am re-running them right now to make sure there are no other errors. The relevant tokens are in the top part of the plot, although the clusters are not perfect.
For the hierarchical clustering algorithm, I again considered between 2 and 50 clusters, basing the optimal number of clusters on the silhouette width of the clusters. The optimal number of clusters is 49 in these data (sw = 0.332), with solutions that have 16 clusters or more reaching acceptable results (sw > 0.2). Again this plot shows the solution with 16 clusters (isolate one cluster by double-clicking on its symbol in the legend). The x- and y-axis show the results from the t-SNE-solution without sentence bounds with perplexity = 30 and 1000 runs. In most cases, context words explain why certain clusters are formed (e.g. cluster 12: tokens having to do with measurements --> for speer these are mostly tokens related to sports, cluster 14 --> small/large).
To summarize, the workflow so far consists of the following steps:
The end goal of the current analysis is to find a set of candidate contextwords to sample tokens for the words that are currently not included (e.g. ding, lid), even though they are also synonyms for our target concept. So the question at this moment is: do we have enough information to make an informed choice about the context words that need to be present in the context of tokens included in this sample?
I think that we do: in my opinion, we can combine the best t-SNE solution for each model with the clustering result (possibly the 15 and 16 clusters from the two models above) to select tokens that most likely refer to the target concept and then extract the (most frequent) context words from them. The process can take the following form:
I expect that we may find a relatively large amount of 'other' tokens. One solution would be to also take a sample of these and analyze them manually. I don't think that there will be many tokens in this group that do refer to the target word, unless they are part of a fixed expression. In the latter case, it is an open question whether we should include them in the final analysis, because the variants used in a fixed expression is not necessarily synonymous (or interchangeable) with the other variants in that context. 6. Begin working on the actual token model that will be used in the analysis :)
Especially for this next part of the analysis, it will be important to not just find any solution that looks acceptable, but to come up with the best solution (if that exists). This will take two forms.
First, it may be necessary to vary more parameters than just the boundary-parameter in the token model, and to come up with the best possible dimensionality reduction solution. Some of the options have been outlined above. While this is not so difficult to do, the difficult part will be deciding which model is the best one. Some questions we can easily answer are:
How different are the models really and how do they differ?
We can use procrustes analysis to make pairwise comparisons of the models. Some notes:
procrustes()´-function inlibrary(vegan)` also comes with a function ´protest()´ that shows whether the difference between two models is significantly large. This may be an interesting result to look at.I've already tried out a procrustes-analyses on two solutions for the first and second model described above (t-SNE, perplexity = 30, nruns = 5000). The plot below shows the residuals for the procrustes analysis, with lighter colors indicating a larger difference between model 2 and model 1.
How stable are the models for other concepts?
Of course, so far I've been focusing on a single concept, even though I selected three test concepts. While I expected this concept to be a pretty difficult one, the results are interpretable. Perhaps it would be useful to construct tokenmodels for the other two concepts before turning to the final model for penis as this may reveal other problems that may be currently not yet obvious in the current data?
Second, another question that we don't have an answer to yet is how scalable the procedure used so far is or, put differently, how necessary it is to do the intermediate step of constructing a tokenmodel for a subset of the variants. Analyzing the other concepts might shed some more light on this question.
Defined in the google doc as: "potentially all words within the specified window span around the target token". Note that my definition of "local" is not extreme, as I am only including nav's with a frequency of > 200. However, it is local in the sense that potentially all these words can be considered (N = 37807)↩
Defined in the google doc as "fixeds set of context words, same for all target types". Here the 5000 most frequent nav's.↩
This may also be related to the parameter settings that are used, e.g. if no nav's of frequency > 200 occur with the target type in a particular token/observation, this type is not included in the model.↩
Note that while it may be dangerous to use this strategy, it doesn't have to be. We just don't know yet.↩
An alternative strategy may be to semasiologically analyze these variants. Specifically for ding this could be a fruitful approach, because this variant is highly polysemous and is also included as a hgh-level word in the WordNet-taxonomies. It is not known whether the penis-meaning of ding would show up in such an analysis.↩
We could select high-frequency candidates from the association data of Gert Storms for this purpose.↩
This determines how many times the algrotihm can try to find a stable solution. If it doesn't succeed in the specified number of random starts, there is no succesful convergence.↩